d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

code:citation

Zhao, S., Gupta, D., Zheng, Q., & Grover, A. (2025). d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning. arXiv:2504.12216.

CC BY 4.0に従って共有

Canonical URL: https://creativecommons.org/licenses/by/4.0/

table:原文ライセンス：CC BY 4.0

クレジット表記 citationとURLに記載

改変の明記本文はアルゴリズムに従い分割（引用記法部分）

(配布を妨げる)技術的手段なし

ーーー

2025/4/19 09:26

original：/tomiokario-close/d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

ーーー

メモ

教師付き追加学習（SFT）と強化学習（RL）を適用する手法（DeepSeek R1の学習法）を拡散言語モデル (dLMM) に適用する手法（d1）を提案する論文．

LLaDA-8B Instruct modelにd1を施すことで元モデルを超える性能が得られた（図1，抜粋上）．

SFT後にタスク特化の強化学習（GRPO）を行うことで，高性能な自己回帰（AR）モデルであるQwen2.5 7bに迫る性能が得られた（図4，抜粋下）．

拡散言語モデルやSFT+RL学習について整理されており，それらを学ぶ資料としても参考になる

結果抜粋

https://gyazo.com/3fd61857937dc7a1ea7bdbf8fe7af5e5

Figure 1: Across four math and logical reasoning tasks, d1-LLaDA, which undergoes SFT followed by our proposed diffu-GRPO, consistently outperforms the base LLaDA-8BInstruct model.

図1：4つの数学的および論理的推論タスクにおいて，SFTの後に我々の提案するdiffu-GRPOを行うd1-LLaDAは，ベースとなるLLaDA-8BInstructモデルを一貫して上回る．

We report results using the best performing generation sequence length for each task and model, with complete sequence length results shown in Table 1.

各タスクとモデルについて，最も性能の良い生成シーケンス長を用いた結果を報告し，完全なシーケンス長の結果を表1に示す．

https://gyazo.com/d5630aec3964b8dedc11eb14559dddcb

Figure 4: Comparison with state-of-the-art dLLMs and AR LLMs of similar size.

図4：最先端のdLLMおよび同規模のAR LLMとの比較．

d1-LLaDA achieves the highest GSM8K score and the second-highest MATH500 score.

d1-LLaDAは最高のGSM8Kスコアと2番目に高いMATH500スコアを達成している．

LLaDA results are from our evaluation using the same 0-shot protocol.

LLaDAの結果は，同じ0ショットプロトコルを用いた我々の評価によるものです．

Scores for other models are from Dream (Ye et al., 2025a), using 8-shot prompts for GSM8K and 4-shot for MATH.

他のモデルのスコアはDream (Ye et al., 2025a)によるもので，GSM8Kでは8ショット，MATHでは4ショットのプロンプトを使用している．

Note that d1-LLaDA has undergone task-specific RL training for each benchmark.

d1-LLaDAは各ベンチマークに対してタスクに特化したRL学習を行っている．

abstract

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL).

最近の大規模言語モデル(LLM)は，オンライン強化学習(RL)の恩恵を受けた強力な推論能力を実証している．

These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm.

これらの能力は，主に左から右への自己回帰（AR）生成パラダイムにおいて実証されてきた．

In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner.

これとは対照的に，拡散に基づく非自己回帰的なパラダイムは，テキストを粗から細へ生成する．

Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning.

最近の拡散に基づく大規模言語モデル(dLLM)は，ARと比較して競争力のある言語モデリング性能を達成しているが，dLLMがLLM推論の最近の進歩を活用できるかどうかは不明である．

To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.

この目的のために，我々は，教師付き微調整（SFT）とRLの組み合わせにより，事前に訓練されたマスクされたdLLMを推論モデルに適応させるフレームワークであるd1を提案する．

Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO.

具体的には，(a)既存のデータセットから直接知識を抽出し，自己改善行動を植え付けるために，マスクされたSFT技術を利用する．

Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks.

実証的研究を通じて，複数の数学的・論理的推論ベンチマークにおける，異なる訓練後レシピの性能を調査する．

We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

その結果，d1が最高の性能をもたらし，最先端のdLLMの性能を大幅に改善することがわかった．

Project page: https://dllm-reasoning.github.io/

プロジェクトページ: https://dllm-reasoning.github.io/

1. Introduction

https://gyazo.com/3fd61857937dc7a1ea7bdbf8fe7af5e5

Figure 1: Across four math and logical reasoning tasks, d1-LLaDA, which undergoes SFT followed by our proposed diffu-GRPO, consistently outperforms the base LLaDA-8BInstruct model.

We report results using the best performing generation sequence length for each task and model, with complete sequence length results shown in Table 1.

各タスクとモデルについて，最も性能の良い生成シーケンス長を用いた結果を報告し，完全なシーケンス長の結果を表1に示す．

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse applications spanning chatbots, coding, summarization, and translation (Achiam et al., 2023; Dubey et al., 2024).

近年の大規模言語モデル（LLM）の進歩は，チャットボット，コーディング，要約，翻訳など，多様なアプリケーションにおいて目覚ましい能力を発揮している（Achiam et al, 2023; Dubey et al, 2024）．

While these models typically scale through next-token prediction on vast corpora via computationally intensive pretraining, the finite availability of high-quality training data poses a fundamental scaling challenge.

これらのモデルは通常，計算集約的な事前学習によって膨大なコーパスのネクストトークンを予測することでスケーリングされるが，高品質な学習データが有限であることがスケーリングの根本的な課題となっている．

Reinforcement learning (RL) methods have emerged as a promising post-training method, enabling

強化学習(RL)手法は，有望なポストトレーニング手法として登場し，次のようなことが可能になった．

In a parallel line of work, discrete diffusion large language models (dLLMs) (Nie et al., 2025; Gong et al., 2025; Nie et al., 2024; Ye et al., 2025a) have emerged as promising nonautoregressive alternatives for language modeling.

これと並行して，離散拡散大規模言語モデル（dLLM）（Nie et al., 2025; Gong et al., 2025; Nie et al., 2024; Ye et al., 2025a）が，言語モデリングのための有望な非自己回帰的代替手法として浮上してきた．

Unlike AR models that generate text token-by-token in a causal manner, dLLMs generate text through an iterative denoising process, refining sequences over multiple steps while leveraging both past and future context via bidirectional attention.

トークンごとに因果的にテキストを生成するARモデルとは異なり，dLLMは反復的なノイズ除去プロセスを通じてテキストを生成し，双方向の注意を介して過去と未来の文脈の両方を活用しながら，複数のステップにわたってシーケンスを精緻化する．

Among them, open masked dLLMs such as LLaDA (Nie et al., 2025) have demonstrated performance comparable to similarly sized AR models, and closed-source dLLMs such as Mercury (Inception Labs et al., 2025) further demonstrate excellent inference efficiency.

その中で，LLaDA（Nie et al., 2025）のようなオープンマスクdLLMは，同規模のARモデルに匹敵する性能を実証しており，Mercury（Inception Labs et al., 2025）のようなクローズドソースdLLMは，さらに優れた推論効率を実証している．

However, leading open-source dLLMs have not undergone RL post-training, leaving this promising direction largely unexplored.

しかし，主要なオープンソースのdLLMはRLのポストトレーニングを受けておらず，この有望な方向性はほとんど未解明のままである．

This paradigm shift raises important questions about how RL post-training might be effectively realized in a non-autoregressive context.

このパラダイムシフトは，RLポストトレーニングが非自己回帰的な文脈でどのように効果的に実現されるかという重要な問題を提起している．

Adapting RL algorithms to masked dLLMs poses unique challenges because existing successful approaches for AR models, such as PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024b), rely on estimating and optimizing policy distributions through computing log-probabilities of generated sequences, which cannot be directly applied to dLLMs.

なぜなら，PPO（Schulman et al., 2017）やGRPO（Shao et al., 2024b）のようなARモデルに対する既存の成功したアプローチは，生成されたシーケンスの対数確率を計算することによってポリシー分布を推定し最適化することに依存しており，dLLMには直接適用できないからである．

While this computation is straightforward in AR models through sequential factorization, dLLMs lack this natural decomposition due to their iterative, non-sequential generation process.

この計算は，ARモデルでは逐次因数分解によって簡単に行えるが，dLLMでは反復的で非逐次的な生成プロセスのため，このような自然な分解ができない．

To bridge this gap, we propose d1, a two-stage post-training framework for enabling reasoning in masked dLLMs.

このギャップを埋めるために，我々はマスクされたdLLMの推論を可能にする2段階のポストトレーニングフレームワークであるd1を提案する．

In the first stage, the model undergoes supervised finetuning (SFT) on high-quality reasoning traces.

第一段階では，高品質な推論トレースに対して教師付き微調整(SFT)を行う．

In the RL stage, we introduce diffu-GRPO, a novel policy gradient method for masked dLLMs that builds upon GRPO with our proposed efficient one-step estimation of log-probabilities.

RL段階では，diffu-GRPOを導入する．diffu-GRPOは，マスクされたdLLMのための新しい政策勾配法であり，GRPOに我々の提案する対数確率の効率的な一段階推定を加えたものである．

Our estimator leverages random prompt masking, which acts a form of regularization for policy optimization, allowing us to scale the number of gradient updates per batch and reduces the number of online generations required by RL training.

我々の推定器は，政策最適化のための正則化の一形態であるランダムプロンプトマスキングを活用することで，バッチあたりの勾配更新数を拡張することを可能にし，RL訓練に必要なオンライン世代数を削減する．

This substantially reduces the compute time.

これにより計算時間が大幅に短縮される．

Empirically, we instantiate d1 using LLaDA-8B-Instruct as our base model.

経験的に，LLaDA-8B-Instructをベースモデルとしてd1をインスタンス化する．

We compare d1-LLaDA performance with the base LLaDA, as well as LLaDA models trained with SFT-only and diffu-GRPO-only recipes.

d1-LLaDAの性能を，基本LLaDA，およびSFTのみ，diffu-GRPOのみのレシピで学習したLLaDAモデルと比較する．

Our experiments demonstrate that d1 consistently outperforms both the base model across four mathematical and logical reasoning benchmarks, as shown in Figure 1.

我々の実験では，図1に示すように，4つの数学的および論理的推論ベンチマークにおいて，d1が一貫してベースモデルを上回ることが実証されました．

It also outperform SFT-only method and diffu-GRPO-only method.

また，SFTのみの手法とdiffu-GRPOのみの手法を上回る．

Additionally, we complement our primary findings with thorough ablation studies and qualitative analysis of the generated text.

さらに，徹底的なアブレーション研究と生成されたテキストの定性的分析により，主要な発見を補完する．

2. Preliminatries

2.1. Masked Diffusion Large Language Models

Masked dLLMs (Austin et al., 2021) involve a forward process that gradually corrupts a sequence of tokens x0 by the mask token.

マスクされたdLLM(Austin et al., 2021)は，マスクトークンによってトークン列x0を徐々に破損していく前進過程を含む．

The process is indexed by time t ∈ ［0, 1］.

この過程は時間t∈［0, 1］で示される．

At timestep t, the sequence xt is partially masked, where for each token the probability of remaining unmasked is αt.

タイムステップtで，シーケンスxtは部分的にマスクされ，各トークンについて，マスクされずに残る確率はαtである．

Particularly, αt (a.k.a noise schedule) is strictly decreasing in t.

特に，αt（ノイズスケジュール）はtに対して厳密に減少する．

When t = 1, all the tokens in x1 are masked.

t = 1のとき，x1のすべてのトークンはマスクされる．

To train a masked dLLM, we begin by designing a forward process with a specific form of αt.

マスクされたdLLMを訓練するために，我々はαtの特定の形を持つ前進過程を設計することから始める．

We parameterize a bidirectional unmasking predictor fθ .

我々は双方向マスキング解除予測量 fθ をパラメータ化する．

In each iteration, we randomly sample a timestep t ∈ ［0, 1) and mask the tokens based on the designed forward process.

各反復において，タイムステップt∈［0, 1］をランダムにサンプリングし，設計された前進過程に基づいてトークンをマスキングする．

Given these corrupted inputs, the learning objective is to predict the original tokens.

これらの破損された入力が与えられると，学習の目的は元のトークンを予測することである．

The standard loss function for this task is the negative evidence lower bound (NELBO), which is an upper bound of the negative log-likelihood (NLL) of the data.

このタスクの標準的な損失関数は，データの負の対数尤度(NLL)の上界である負のエビデンス下界(NELBO)である．

For masked dLLMs, NELBO simplifies to a weighted NLL,

マスクされたdLLMの場合，NELBOは重み付きNLLに単純化される，

https://gyazo.com/7faf62afadd3c0a64549f4de9f9230df

where |xt| is the sequence length of x, and xk is the k-th token.

ここで｜xt｜はxのシーケンス長，xkはk番目のトークンである．

Note that the loss is only calculated for tokens that are masked out in timestep t.

損失はタイムステップtでマスクされたトークンに対してのみ計算されることに注意．

The key difference between masked dLLMs and BERT (Devlin et al., 2019) is that the latter uses a fixed masking ratio and the decoding is a single-step infilling process, whereas masked dLLMs use time-varying masking ratios and the decoding process involves multiple steps starting from pure noise and thus resulting in a generative model.

マスキングされたdLLMとBERT（Devlin et al., 2019）の主な違いは，後者が固定のマスキング比を使用し，デコーディングがシングルステップのインフィリングプロセスであるのに対し，マスキングされたdLLMは時間変化するマスキング比を使用し，デコーディングプロセスは純粋なノイズから始まる複数のステップを含み，その結果，生成モデルになることである．

Further details about the formulation of masked dLLMs are deferred to Appendix A.

マスキングされたdLLMの定式化についての詳細は付録Aに譲る．

2.2. Group Relative Policy Optimization for Large Language Models

（中略）

3. d1: Adapting Pre-trained Masked dLLMs to Reasoning Models

We propose d1, a two-stage framework that enhances the reasoning performance of pre-trained masked dLLMs by sequentially combining SFT and online RL.

我々は，SFTとオンラインRLを順次組み合わせることにより，事前に訓練されたマスクされたdLLMの推論性能を向上させる2段階のフレームワークであるd1を提案する．

performance of offline trained language model (Shao et al., 2024b; Guo et al., 2025; Team et al., 2025).

オフラインで学習された言語モデルの性能 (Shao et al., 2024b; Guo et al., 2025; Team et al., 2025)．

However, the learning formulation of GRPO does not directly generalize to dLLMs.

しかし，GRPOの学習定式化はdLLMには直接一般化できない．

The objective of GRPO (3) requires computing the (log-)likelihood ratio of πθ and πθold , at both the token level (for the advantage weights) and the sequence level (for the reverse KL term).

GRPOの目的(3)はπθとπθoldの(対数)尤度比をトークン・レベル(アドバンテージ重みのため)とシーケンス・レベル(逆KL項のため)の両方で計算する必要がある．

Generally speaking, we need to efficiently compute the per-token and as Transformers, directly model the per-token log-probabilities, and the sequence-level log-probability of o can be easily computed through the chain rule using one forward pass:

一般的に言えば，トークン単位の対数確率を効率的に計算し，トランスフォーマーとして，トークン単位の対数確率を直接モデル化する必要がある：

https://gyazo.com/d0bd7de602373333b14e8892583b2ad7

Similarly, the KL term can be decomposed as

同様に，KL項は次のように分解できる．

https://gyazo.com/996bd0f00a3fa9ec59ee12bc6427272c

Unlike AR models, dLLMs do not adhere to sequential factorization of the sequence log-probability.

ARモデルとは異なり，dLLMはシーケンスの対数確率の逐次因数分解には従わない．

Meanwhile, the per-token log-probability are also costly to compute since the decoding process invokes the unmasking predictor fθ multiple times1.

一方，デコーディングプロセスはアンマスキング予測子fθを複数回呼び出すので，トークンごとの対数確率も計算コストが高い1．

As the first step, we propose an efficient logprobability estimator in Section 3.1.

最初のステップとして，セクション3.1で効率的な対数確率推定器を提案する．

Next, using these estimators, we introduce diffu-GRPO, a variant of GRPO for dLLMs in Section 3.2.

次に，これらの推定量を用いて，セクション3.2でdLLMのためのGRPOの変形であるdiffu-GRPOを紹介する．

Last, we discuss our SFT recipe in Section 3.3.

最後に，セクション3.3で我々のSFTレシピについて述べる．

3.1 Efficient Log Probability Estimation for Masked dLLMs

For sequence log-probability, we use a mean-field approximation that decomposes it into a product of independent per-token log-probabilities.

シーケンスの対数確率については，トークンごとの独立した対数確率の積に分解する平均場近似を用いる．

For per-token log-probability, we introduce an estimation method that only calls fθ once.

トークン毎の対数確率については，fθを一度だけ呼び出す推定法を導入する．

（中略）

3.2 diffu-GRPO: Policy Gradient Optimization for Masked dLLMs

Using the log-probability estimators proposed in Section 3.1, we extend GRPO to masked dLLMs.

セクション3.1で提案された対数確率推定量を用いて，GRPOをマスクされたdLLMに拡張する．

Let φπθ (ok | q′) and φπθ (o | q′) denote the estimated per-token and sequence probabilities for πθ .

ここで，φπθ（ok｜q′）とφπθ（o｜q′）は，πθのトークン毎の確率とシーケンス確率の推定値を表すとする．

We derive the loss function of diffu-GRPO,

diffu-GRPO の損失関数を導出する，

https://gyazo.com/89d0f2b6259ba52fa6de7344b0304825

Our algorithm is summarized in Algorithm 1.

我々のアルゴリズムはアルゴリズム1に要約されている．

To efficiently optimize the policy loss, in practice, on-policy RL algorithms such as PPO and GRPO perform multiple gradient updates for each batch of samples.

ポリシーの損失を効率的に最適化するために，実際には，PPO や GRPO のようなオンポリシー RL アルゴリズムは，サンプルの各バッチに対して複数の勾配更新を実行する．

During these updates, the prompt p, completions {oi }G i=1, old policy πθold and advantages Ak i (πθold ) are kept fixed.

これらの更新の間，プロンプトp，完了{oi }G i=1，古いポリシーπθold，および利点Ak i (πθold )は固定されたままである．

However, determining the optimal number of gradient updates per batch is challenging.

しかし，バッチごとの最適な勾配更新数を決定することは困難である．

If the number is too high, it can lead to overfitting within the batch, while a number that is too low slows down convergence.

数が多すぎるとバッチ内のオーバーフィッティングにつながり，逆に少なすぎると収束が遅くなる．

Achieving a balance between outer batch iterations and inner gradient updates is crucial for sample efficiency.

外側のバッチ反復と内側の勾配更新のバランスをとることは，サンプルの効率にとって非常に重要である．

Besides, every outer batch iteration requires sampling completion through iterative denoising steps, which incurs high computational cost.

その上，全ての外側バッチ反復は，反復的ノイズ除去ステップによるサンプリング完了を必要とし，高い計算コストが発生する．

Interestingly, our log-probability estimator offers a unique mitigation to this dilemma.

興味深いことに，我々の対数確率推定器はこのジレンマをユニークに緩和する．

For each gradient update step, we randomly mask the prompt q to q′ to estimate the logprobabilities.

各勾配更新ステップにおいて，対数確率を推定するために，プロンプトqをランダムにq′にマスクする．

Intuitively, this stochastic masking introduces perturbed views of the same (prompt, completion) pairs, serving as a form of regularization for policy optimization.

直観的には，この確率的マスキングは，同じ（プロンプト，完了）ペアの摂動ビューを導入し，ポリシー最適化のための正則化の一形態として機能する．

It can also be viewed as a form of data augmentation, extracting more supervision signals from the same data.

また，同じデータからより多くの監督シグナルを抽出する，データ増強の一形態と見なすこともできる．

Empirically, we found that this approach, unique to masked diffusion models, allows us to scale μ to higher values while maintaining stable learning dynamics.

経験的に，マスクされた拡散モデル特有のこのアプローチにより，安定した学習ダイナミクスを維持しながら，μをより大きな値に拡張できることがわかった．

As a consequence, it reduces the number of outer batch iterations required for convergence, which in turn decreases the number of online generations needed and ultimately results in significantly lower computational cost.

その結果，収束に必要な外側のバッチ反復の回数が減り，必要なオンライン世代の数が減り，最終的に計算コストが大幅に削減される．

As shown in Figure 5, training with higher values of μ achieves the same reward performance in substantially less wall clock time.

図5に示すように，μをより大きな値で学習すると，同じ報酬性能をより少ない壁掛け時間で達成することができる．

3.3. Supervised FineTuning with Reasoning Data

We perform SFT of LLaDA on s1K (Muennighoff et al., 2025), a curated dataset consisting of 1000 high-quality reasoning questions.

LLaDAのSFTを，質の高い推論問題1000問からなるキュレーションデータセットs1K（Muennighoff et al.

The reasoning traces in s1K exhibit detailed step-by-step problem-solving processes, including verification of intermediate results and backtracking when encountering errors or dead ends.

s1Kの推論トレースには，中間結果の検証や，エラーや行き詰まりに遭遇した場合のバックトラックを含む，詳細なステップバイステップの問題解決プロセスが示されている．

The SFT algorithm is summarized in Algorithm 2, where tokens are randomly masked during training according to a timevarying schedule.

SFTアルゴリズムはアルゴリズム2に要約されており，トークンは時間変化するスケジュールに従って訓練中にランダムにマスクされる．

The model is optimized to predict the original tokens given their context.

モデルは，文脈から元のトークンを予測するように最適化される．

We find that for SFT to work effectively in practice, various design choices must be carefully considered, whose details are discussed in Appendix B.2

SFTが実際に効果的に機能するためには，様々な設計上の選択を注意深く考慮する必要がある．

Experiments

To understand how reasoning capabilities can be scaled in masked dLLMs through training adaptations, we conduct comprehensive experiments to answer the following research questions:

マスクされたdLLMにおいて，訓練適応によって推論能力をどのように拡張できるかを理解するために，以下の研究課題に答える包括的な実験を行う．

(1) How do SFT on reasoning traces and applying diffu-GRPO independently improve LLaDA’s reasoning capabilitie?

(1)推論トレースに対するSFTとdiffu-GRPOの適用が，それぞれ独立にLLaDAの推論能力をどのように向上させるか？

(2) What additional gains can be achieved by combining them to create d1-LLaDA?

(2)これらを組み合わせてd1-LLaDAを作成することで，どのような効果が得られるか？

(3) Design Choices of diffu-GRPO: How does the proposed log probability estimation with randomized masking in diffu-GRPO and the masking probability pmask affect training efficiency and stability?

(3)diffu-GRPOの設計上の選択：diffu-GRPOで提案されているランダム化マスキングを用いた対数確率推定とマスキング確率pmaskは学習効率と安定性にどのような影響を与えるか？

4.1 Models, Tasks and Setups

Models

We employ LLaDA-8B-Instruct (Nie et al., 2025), a state-of-the-art open-sourced dLLM that has not undergone post-training, as our primary experimental testbed and baseline.

LLaDA-8B-Instruct（Nieら，2025）は，最新のオープンソースdLLMであり，ポストトレーニングを受けていない．

We apply 3 post-training recipes to LLaDA-8B-Instruct: (a) SFT, (b) diffu-GRPO, (c) d1: applying diffu-GRPO on the checkpoint after SFT, where we refer to them as LLaDA+SFT, LLaDA+diffu-GRPO, and d1-LLaDA, respectively.

(a)SFT，(b)diffu-GRPO，(c)d1：SFT後のチェックポイントにdiffu-GRPOLLaDA+diffu-GRPO, d1-LLaDA, respectively.を適用する，

task

We conduct experiments on four reasoning tasks in two categories:

次の2つのカテゴリーに分類される4つの推論タスクについて実験を行う．

(1) Mathematical reasoning: we use GSM8K (Cobbe et al., 2021), a dataset of multi-step grade school math problems, and MATH500 (Lightman et al., 2023), a curated subset of 500 problems drawn from the MATH dataset (Hendrycks et al., 2021) comprising high-school competition math problems;

(1) 数学的推論：多段階の小学校の数学問題のデータセットであるGSM8K (Cobbe et al., 2021)と，高校生の競技数学問題からなるMATHデータセット(Hendrycks et al., 2021)から抽出された500の問題のキュレーションされたサブセットであるMATH500 (Lightman et al., 2023)を使用する．

(2) Logical reasoning: this includes two tasks: 4x4 Sudoku puzzles, which require constraint satisfaction and systematic elimination to fill a grid with numbers; and Countdown with 3 numbers, a combinatorial arithmetic game in which models must reach target numbers using basic arithmetic operations on a given set of numbers.

(2) 論理的推論：これには2つの課題が含まれる

4x4の数独パズルは，制約充足と系統的な消去を必要とし，格子を数字で埋める．

3つの数字を使ったカウントダウンは，組み合わせ算術ゲームであり，モデルは与えられた数字の集合に対する基本的な算術演算を使って目標数字に到達しなければならない．

All tasks are evaluated in a zero-shot setting.

すべてのタスクはゼロショット設定で評価される．

training

For SFT, we train on s1k (Muennighoff et al., 2025) for 20 epochs, with a sequence length of 4096.

SFTでは，s1k (Muennighoff et al., 2025)を用いて，シーケンス長4096で20エポック学習する．

For RL, we train a separate model for each task.

RLについては，タスクごとに別々のモデルを訓練する．

More specifically, for GSM8K, MATH500, we train on the training split; for Countdown and Sudoku, we train on synthetic generated datasets.

より具体的には，GSM8KとMATH500については訓練分割で訓練し，カウントダウンと数独については合成生成データセットで訓練する．

We use a composed reward function that combines both formatting and correctness rewards.

我々は，フォーマット報酬と正誤報酬の両方を組み合わせた報酬関数を使用する．

Due to the heavy computational cost of online generations, we limit the generation sequence length of online generations to be 256 throughout RL training.

オンライン世代は計算コストが大きいため，RL学習を通してオンライン世代の世代列長を256に制限する．

Other hyperparameters of training, training and evaluation datasets, reward functions, and inference setups are detailed in Appendix B.

その他のトレーニングのハイパーパラメータ，トレーニングデータセットと評価データセット，報酬関数，推論セットアップについては付録Bに詳述する．

evaluaation

For all the benchmarks, we evaluate LLaDA-8B-Instruct and LLaDA+SFT on the final checkpoint for all the tasks.

すべてのベンチマークについて，LLaDA-8B-InstructとLLaDA+SFTをすべてのタスクの最終チェックポイントで評価した．

For LLaDA+diffu-GRPO and d1-LLaDA, we evaluate every 100 steps starting from step 600 and report the best results.

LLaDA+diffu-GRPOとd1-LLaDAについては，ステップ600から100ステップごとに評価し，最良の結果を報告する．

We evaluate with generation sequence length 128, 256 and 512 separately.

世代配列長128，256，512で別々に評価．

4.2 Main Results

4.3 Design Choices and Ablations for diffu-GRPO

Benefits of Randomized Masking in Per-Token Likelihood Estimation

Our randomized masking mechanism provides significant advantages for training masked dLLMs.

我々のランダム・マスキング・メカニズムは，masked dLLMを訓練する際に大きな利点をもたらす．

As shown in Figure 5, random masking consistently outperforms fixed masking across different values of policy optimization updates (μ).

図5に示すように，ランダムマスキングは，ポリシーの最適化更新（μ）の異なる値において，一貫して固定マスキングを上回る．

While conventional approaches typically limit μ to 2 due to diminishing returns and overfitting risks, our approach enables scaling μ to much higher values (12, or even 24) while maintaining or improving performance, facilitating faster convergence of RL training.

従来のアプローチでは，収穫逓増とオーバーフィッティングの危険性から，μは通常2に制限されていたが，我々のアプローチでは，性能を維持または向上させながら，μをはるかに大きな値（12，あるいは24）にスケーリングすることができ，RL訓練の収束を促進する．

Consequently, fewer number of generations are needed, which in turn remarkably reduces the computational cost.

その結果，必要な世代数が少なくなり，計算コストが著しく削減される．

The rightmost plot demonstrates the real-world efficiency gains, where models with higher μ values achieve better correctness rewards in significantly lesser wall clock time.

右端のプロットは，より高いμ値を持つモデルが，より少ないウォールクロック時間で，より良い正しさの報酬を得るという，実世界での効率向上を示している．

This efficiency stems from creating diverse views of the input data during each optimization step, allowing the model to prevent in-batch overfitting and extract more learning signal from each generation.

この効率は，各最適化ステップで入力データの多様なビューを作成することにより，モデルがバッチ内オーバーフィッティングを防ぎ，各世代からより多くの学習シグナルを抽出できることに起因する．

https://gyazo.com/b516be113e0f0101fa27a00d815aa5a2

Figure 6: Ablations of prompt masking probability (pmask) on GSM8K reward trends during training.

図6：訓練中のGSM8Kの報酬傾向に対する促音マスキング確率（pmask）のアブレーション．

Lower masking probabilities (0.1, 0.3) show more stable and higher performance, while higher masking probabilities (0.5, 0.7) demonstrate increased instability particularly in later training stages.

低いマスキング確率(0.1, 0.3)はより安定した高いパフォーマンスを示すが，高いマスキング確率(0.5, 0.7)は特にトレーニングの後半で不安定性が増す．

Effect of Masking Rate on Training Stability and Performance

We further investigate the impact of the prompt masking probability pmask on model training dynamics.

さらに，プロンプトのマスキング確率pmaskがモデルの学習ダイナミクスに与える影響を調べた．

Figure 6 reveals a clear trade-off between stability and performance at different masking rates.

図6から，異なるマスキング確率における安定性と性能のトレードオフが明らかになった．

Lower masking probabilities provide more consistent context information during training, resulting in more stable learning curves and better final performance.

マスキング確率が低いほど，訓練中に一貫した文脈情報が得られるため，学習曲線が安定し，最終的な性能が向上する．

In contrast, higher masking probabilities introduce more variability.

これとは対照的に，マスキング確率が高いほどばらつきが大きくなる．

For example, pmask = 0.7 shows dramatic performance fluctuations and significant performance degradation after 3000 RL steps.

例えば，pmask = 0.7は，3000RLステップ後に劇的な性能変動と著しい性能劣化を示す．

The intermediate value pmask = 0.5 maintains reasonable performance but still shows instability in later training stages.

中間の値であるpmask = 0.5では，妥当な性能は維持されるが，後の学習段階で不安定さが見られる．

Our findings suggest that lower masking rates (pmask ≤ 0.3) achieve the optimal balance for Diffu-GRPO training, providing sufficient variability to prevent overfitting while maintaining the stability necessary for performance improvements.

この結果は，より低いマスキング率（pmask ≤ 0.3）がDiffu-GRPOトレーニングの最適なバランスを達成することを示唆しており，パフォーマンス向上に必要な安定性を維持しながら，オーバーフィッティングを防ぐのに十分な変動性を提供する．

5 Related Work

Diffusion Language Models

While diffusion models have achieved remarkable success in the visual domain (Song et al., 2020; Ho et al., 2020), their application to language has been limited, partly due to text’s discrete nature.

拡散モデルは視覚領域では目覚ましい成功を収めているが（Song et al.2020; Ho et al., 2020) テキストが離散的な性質を持っていることもあり、言語への適用は限定的であった．

Initial approaches attempted to learn continuous diffusion models over textual latents (Austin et al., 2021; Gulrajani & Hashimoto, 2023), but faced challenges with scalability and discretization.

初期のアプローチでは，テキスト潜在情報に対する連続拡散モデルの学習が試みられたが(Austin et al., 2021; Gulrajani & Hashimoto, 2023)，スケーラビリティと離散化の問題に直面した．

Masked diffusion has been established as a specific instance of discrete diffusion (Austin et al., 2021; Sahoo et al., 2024; Nie et al., 2024), with recent efforts scaling these models significantly.

マスク拡散は離散拡散の特殊な例として確立されており(Austin et al., 2021; Sahoo et al., 2024; Nie et al., 2024), ，最近の取り組みではこれらのモデルを大幅に拡張している．

DiffuLLaMA (Gong et al., 2025) extended this approach by initializing masked diffusion language models with pretrained LLaMA weights.

DiffuLLaMA (Gong et al., 2025)は，事前に学習されたLLaMA重みでマスク拡散言語モデルを初期化することで，このアプローチを拡張した．

Ye et al.(2024b) explored how diffusion language models can generate chain-of-thought reasoning, and complex reasoning tasks on smaller-scale models (Ye et al., 2024a), highlighting their advantages over autoregressive models in reversal tasks, though their traces lacked self-correction capabilities.

Yeら(2024b)は，拡散言語モデルがどのように思考連鎖推論を生成できるか，また，より小規模なモデル(Yeら, 2024a)で複雑な推論タスクを生成できるかを探求し，そのトレースには自己修正機能がないものの，反転タスクにおける自己回帰モデルに対する優位性を強調した．

Arriola et al.(2025) proposed Block Diffusion, a hybrid approach that models sequences block-by-block while applying diffusion within each block, allowing flexible length generation and improving inference efficiency with kv-caching.

Arriolaら(2025)はブロック拡散を提案した．ブロック拡散とは，ブロックごとにシーケンスをモデル化し，各ブロック内で拡散を適用するハイブリッドアプローチであり，柔軟な長さの生成を可能にし，kvキャッシュにより推論効率を向上させる．

Recently, LLaDA (Nie et al., 2025) and Dream (Ye et al., 2025a) demonstrated that large diffusion language models can achieve performance comparable to similarlysized autoregressive alternatives, but have not yet been enhanced through reinforcement learning.

最近，LLaDA(Nie et al., 2025)とDream(Ye et al., 2025a)は，大規模拡散言語モデルが，同規模の自己回帰モデルに匹敵する性能を達成できることを示したが，強化学習による強化はまだ行われていない．

To the best of our knowledge, we are the first to demonstrate the efficacy of policy gradient-based reinforcement learning algorithms on large diffusion language models.

我々の知る限り，大規模拡散言語モデルにおける政策勾配ベースの強化学習アルゴリズムの有効性を実証したのは，我々が初めてである．

Improving Reasoning Abilities of LLMs through SFT and RL

Approaches to enhance reasoning capabilities in large language models generally fall into two categories: supervised finetuning and reinforcement learning.

大規模言語モデルの推論能力を向上させるアプローチは，一般的に，教師ありの微調整と強化学習の2つのカテゴリに分類される．

SFT on high-quality reasoning traces (Yu et al., 2023; LI et al., 2024; Paster et al., 2023) has shown promising results, while fewer but carefully curated reasoning datasets (Ye et al., 2025b; Muennighoff et al., 2025; Zhou et al., 2023) can outperform larger datasets.

高品質な推論トレースに対するSFT（Yu et al., 2023; LI et al., 2024; Paster et al., 2023）は有望な結果を示しているが，少ないが注意深くキュレーションされた推論データセット（Ye et al., 2025b; Muennighoff et al., 2025; Zhou et al., 2023）は，より大規模なデータセットを上回ることができる．

Chu et al.(2025) demonstrate that SFT-based reasoning often relies on memorization rather than generalization, while RL methods achieve better transfer to novel scenarios, particularly when intermediate reasoning steps are difficult to supervise.

Chuら(2025)は，SFTベースの推論がしばしば汎化ではなく記憶に依存する一方，RL手法は，特に中間推論ステップの監視が困難な場合に，新規シナリオへのより良い移行を達成することを実証している．

Recently, algorithms like GRPO (Shao et al., 2024a) enable efficient training by estimating advantages from group scores without requiring additional critic models as in PPO.

最近，GRPO (Shao et al., 2024a)のようなアルゴリズムは，PPOのように批判モデルを追加することなく，グループスコアから利点を推定することで，効率的な学習を可能にする．

Guo et al.(2025) demonstrate that strong reasoning capabilities can emerge through RL even without SFT (DeepSeek-R1-Zero), producing long reasoning traces with self-reflection and verification steps that significantly improve performance on mathematical tasks.

Guoら(2025)は，SFT(DeepSeek-R1-Zero)がなくても，RLによって強力な推論能力が出現することを実証し，数学的タスクのパフォーマンスを大幅に向上させる自己反省と検証ステップを含む長い推論トレースを生成する．

The development of strong reasoning models like R1 has in turn sparked renewed interest in SFT for smaller models using distilled reasoning traces from these expert reasoners.

R1のような強力な推論モデルが開発されたことで，これらのエキスパート推論者から抽出された推論トレースを使用する小規模なモデルのSFTに再び関心が集まっています．

Datasets like OpenThoughts (Team, 2025) and OpenR1-Math2, which contain reasoning traces from DeepSeek R1, enable smaller models to learn step-by-step problem-solving from expert demonstrations.

OpenThoughts (Team, 2025)やOpenR1-Math2のようなデータセットには，DeepSeek R1からの推論トレースが含まれており，より小さなモデルが専門家のデモンストレーションからステップバイステップの問題解決を学習することができます．

For RL in dLLMs, prior work by Zekri & Boull ́e (2025) proposed a policy gradient framework using concrete score matching, but it relies on gradient-flow computations and does not target masked objectives.

dLLMのRLに関しては，Zekri & Boull ́e (2025)による先行研究が，具体的なスコアマッチングを用いた政策勾配のフレームワークを提案しているが，これは勾配フローの計算に依存しており，マスクされた目的を対象としていない．

In contrast, our method is tailored to masked dLLMs with efficient policy gradient calculation and improved learning efficiency through random masking.

これに対して我々の手法は，効率的な政策勾配計算とランダムマスキングによる学習効率の向上により，マスキングされたdLLMを対象としている．

Our work is among the first to explore improving reasoning in diffusion-based LLMs via both SFT and RL.

我々の研究は，SFTとRLの両方を用いて拡散ベースのLLMの推論を改善することを探求した最初のものの一つである．

6. Conclusion

In this work, we explore different recipes for scaling reasoning capabilities in diffusion large language models.

本研究では，拡散大規模言語モデルにおける推論能力を拡張するための様々なレシピを探索する．

We first explore the application of SFT on reasoning datasets, which yields improved reasoning performance and reveals the emergence of ”Aha moments” as generation length increases.

まず，推論データセットにSFTを適用することで，推論性能を向上させ，世代が長くなるにつれて「アハモーメント」が出現することを明らかにする．

Furthermore, we introduce diffu-GRPO, an efficient policy gradient method specifically designed for dLLMs, which consistently outperforms SFT across multiple benchmarks.

さらに，dLLMのために特別に設計された効率的な政策勾配法であるdiffu-GRPOを導入し，複数のベンチマークにおいて一貫してSFTを上回る性能を示す．

Finally, we consolidate these findings into the d1 recipe, a twostage training pipeline that combines SFT and diffu-GRPO.

最後に，これらの知見を，SFTとdiffu-GRPOを組み合わせた2段階の学習パイプラインであるd1レシピに統合する．

d1 delivers the most significant improvements over the baseline model, compared to SFT and diffu-GRPO alone.

d1は，SFTとdiffu-GRPO単独と比較して，ベースラインモデルに対して最も顕著な改善をもたらす．

Several promising research directions remain underexplored though, including developing efficient decoding strategies that can scale generation length to facilitate more efficient and effective RL training.

しかし，より効率的で効果的なRL学習を促進するために，世代長を拡張できる効率的な復号化戦略の開発など，いくつかの有望な研究の方向性は未解明のままである．